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Abstract 

The concept of refinement from probability elicitation is considered for proper scoring rules. Tak- 
ing directions from the axioms of probability, refinement is further clarified using a Hilbert space 
interpretation and reformulated into the underlying data distribution setting where connections to 
maximal marginal diversity and conditional entropy are considered and used to derive measures 
that provide arbitrarily tight bounds on the Bayes error. Refinement is also reformulated into the 
classifier output setting and its connections to calibrated classifiers and proper margin losses are 
established. 

Keywords: Refinement Score, Probability EUcitation, CaUbrated Classifier, Bayes Error Bound, 
Conditional Entropy, Proper Loss 

1. Introduction 

The concept of refineme nt can be traced back to a well known partition of the Brier (or quadratic) 
score in early works by iMurphyl (119721) but was explicitly defined and generalized for all proper 



scorin g rules in a series of seminal papers by DeGroot and Fienberg iDeGroot and FienbergI (11982 



1983h . This concept is also well known under different na mes depending on the literature. In the 
forecasting a nd meteorolog y literature it is know as sharpness lSandersl (ll963r):lGneiting et al.l(l2007h 
or resolution iBrckerl (12009b and in the probabihty e h citation lit e rature ISavagel (Il97lh it is also know 
as sufficiency iDeGroot and FienbergI (119831. 119821) : ISchervishI (Il989l) . This concep t has also been 



studie d most recently in the meteorology an d forecasting li terature in papers such as iGneiting et al. 
(l2005h : IWilksl (l2006h : iGneiting et all (l2007h : iBrcketl (l2009h . 



Despite the fact that refinement is closely related to proper scoring rules and calibrated loss 
functions it has remained largely restricted to the probability elicitation and forecasting literature. 
In this paper we initially briefly review the concepts of calibration and refinement. The concept 
of refinement will be emphas ized and explained using the original works of DeGroot and Fienberg 
DeGroot and FienbergI (|l982h . We will then proceed to bring three different yet closely interlocked 
arguments that will each initially seem to refute the validity of the refinement concept, but will in- 
stead after a subtle clarification, lead to the generalization of the refinement concept and establish its 
connections t o Bay es error, maximum rn a rgin d iv ersity and conditional entropy in feature selection 
Vasconcelos (2002); Fleuret and Guyon (2004); Peng et al. ( 2005 ); Vasconcelos and VasconceiosI 
( 2009), and classi fi cation with Bayes calibrated loss fu nctions Friedman et al. 720001) ;'Zhang|(|2004|); 
Buja et al.l (l2005l) ; iMasnadi-Shirazi and VasconceiosI (12008 ); .Reid and Williamsoni (i2010l) among 
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others. Specifically, the original refinement definition on the probability elicitation setting will be 
extended to the classifier output setting and underlying data distribution setting. 

A series of results are presented by extending refinement to the underlying data distribution set- 
ting which show that conditional entropy and maximum margin diversity used in feature selection 
are a special case of refinement using the logistic score function. A number of other novel refine- 
ment measures based on other score functions are derived along with conditional refinement which 
can be used for feature selection and ranking. Refinement is also related to th e Bayes error. A 
number of well known bounds on the B ayes eiTor s uch as t he Battachaiyy b ound Fukunagal (Il990h 
the asymptotic near est neighbo r bound iFukunagal (|l990h : ICover and Hard (|l967h and the Jensen- 
Shannon divergence iLinI (119911) are shown to be special cases of refinement measures. Other novel 
bounds on the Bayes error are derived using the refinement interpretation along with a method for 
deriving arbitrarily tight bounds on the Bayes eiTor. 

Extending refinement to the classifier output setting allows for a statistically rigorous parallel 
to the classifier margin which we call classifier marginal density which allows for the ranking of 
calibrated classifiers simply based on their outputs. We also show how each calibrated loss function 
has a conesponding refinement measure and derive a number of such novel measures. 

Refinement is also further studied in its original probability elicitation setting and a Hilbert 
space and inner product interpretation is provided. The inner product interpretation leads to further 
insight into refinement using different symmetric scoring rules. 

The paper is organized as follows. In Section-|2]a review of the refinement concept in probability 
elicitation is provided. In Section-|3]the refinement concept is further analyzed from the perspective 
of the axioms of probability which leads to a novel refinement formulation in the underlying data 
distribution setting and connections to the Bayes eiTor. In Section-|4] refinement and its connections 
to maximal marginal diversity and conditional entropy are considered. Connections to calibrated 
classifiers are considered in Section-|5]- In Sections- 16 17 1 and [8] refinement is further studied in its 
original setting, the proposed classifier output and underlying data distribution settings, respectively. 
Finally, in Section-|9]refinement in the underlying data distribution setting is used to derive measures 
that provide arbitrarily tighter bounds on the Bayes enor. Summary and conclusions are provided 
in Section-ITOl 



2. Refinement In Probability Elicitation 

In probability elicitation ISavage (fl97lh a forecaster produces a probability estimate f} of the oc- 
currence of event y = 1 where y G {1, — 1}, such as a weatherman predicting that it will rain 
{y = 1) tomorrow, rj = P{l\fj) is the actual relative frequency of event y = 1 (rain) among those 
days which the forecaster's prediction was 17. A forecaster is said to be calibrated if r] = fj for all 
17, meaning that the weatherman is skilled and trustworthy. In other words it actually rains rj = fj 

percent of the time when he predic t s the chance of rain is fj. 

It has been argued in lDeGrootl(|l979h : lOeGroot and FienbergI (|l982l) : bawidl (|l982l) that a cal- 
ibrated forecaster is not necessarily a good forecaster or an informative and useful one and that 
another concept called refinement is also needed to evaluate forecasters. Intuitively, let s(?}) denote 
the probability density function of the forecaster's predictions, then the more concentrated the prob- 
ability density function s{f]) is ai^ound the values fj = and 77 = 1 the more refined the forecaster 
is. To further demonstrate th e concept of refinement , it is u seful to consider the following slightly 
modified example taken from lDeGroot and FienbergI (119821) . Consider two calibrated weather fore- 
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casters A and B working at a location where the expected probability of rain is fi = 0.5 on any 
given day. Weatherman A is such that 

SAif^) = 1 (1) 

SAiv) = for fij^fi (2) 

and weatherman B is such that 

sb{1) = /u (3) 

SB{0) = l-fi (4) 

55(77) = for 77/0,1. (5) 

Both forecasters can be calibrated. To demonstrate this, assume that both weathemien make 100 
predictions. Weathemian A predicts that the chance of rain is 57 = 0.5 all the time. If it actually 
rains as expected on 50 days we have r/ = ^ = 0.5 so 77 = 77 and A is calibrated. In the case of 
weatherman B the predictions are 1) chance of rain is = 1 on 50 days and 2) chance of rain is 
f) = on the other 50 days. If it actually rains on the 50 days B predicted rain then 77 = |^ = 1 and 
if it actually does not rain when B predicted no rain then 77 = ^ = 0. In either case 77 = 77 and B is 
also calibrated. 

Although we have shown that both A and B are calibrated forecasters, it is acceptable to say 
that the forecasts made by A are useless while forecaster B is the ideal weatherman in the sense 
that he only makes definite predictions of chance of rain is or chance of rain is 1 and is always 
correct. On the other hand, forecaster A always makes the conservative but useless prediction that 
chance of rain is 0.5. We say th at weatherman A is the l east-r efined forecaster and that weatherman 



B is the most-refined forecaster iDeGroot and Fienbergl(|l982|) . T his leads to the argument that well 



calibrated forecasters can be compared based on their refinement IDeGroot and FienbergI (119831) . 

Before providing a formal measure of refinement, proper scoring functions need to be intro- 
duced. A scoring function is such that a score of Ji (77) is attained if the forecaster predicts i) and 
event y = 1 actually happens and a score of I-i{fi) is attained if event y = —1 happens. 11(77) is 
an increasing functions of f) and I-i{fj) is decreasing in fj. Since the relative frequency with which 
the forecaster makes the prediction fj is s(7/), the expected score of the forecaster over all 77 and y is 

s(7?)[7?/i(7?) + (l-7?)/_i(7?)]d(7/), (6) 



V 



and the expected score for a given r) is 

I{7],7]) = r]Ii{fi) + {1 - 7])I^l{fl). (7) 

The score function is denoted as strictly proper if 11(77) and 1-1(77) are such that the expected 
score of ([7]) is maximized when 77 = 77 or in other words 

1(77,77) < 1(77,77) = J(77). (8) 

It can be shown [Savage (fl97lh that a score function is strictly proper if and only if the maximal 



reward function J{r]) is strictly convex and 

h{r]) = J(r7) + (1 - 77) J'(r7) (9) 
/_i(77) = J(r7)- 77 J' (77). (10) 
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A formal definition of refinement can be provided when considering the proper scoring function 
ly. The expected score Sj^ can be written as 



Si. 



5(f?) P(l|77)/i(77) + P(-l|r?)/_i(r?))W(?7). 



(11) 



By simply adding and subtracting s{i])[P{l\fj)Ii{r]) + P(— l|57)/_i(77)], we can dissect any ex- 
pected score Sly of a fore caster into two parts of SnnUhr nUnn and S Refinement that are measures of 
calibration and refinement DeGroot and FienbergI ( 1983 ) 



(12) 



Jfl y 

s{fi)(^Pil\fi)hifi) + P(-l|7})/„i(r})))d(??) 

s{fi) [Pil\fi)[hifi) - /i(r?)} + P{-l\fi)[l.,ifi) - /_i(77)}] d{fi) 
+ / s{fj)\p{l\f,)h{r,) + P{-l\fj)I^i{r,)]dir]) 



S, 



Calibration 



+ s 



Refinement' 



Recall that /(r), r]) < /(?/, ?/) = J(r/) so that ScaHbration has a maximum equal to zero when 
the forecaster is calibrated (fj = t]) and is negative otherwise. 
The second term S Re finement can be simplified to 



SB.ef^nement = / s{fl) P{l\fl)h{r,) + P{-l\fl)I^i{r,) d{fj) (13) 

s{fi) r]Ii{r]) + (1 - v)I-i{v) d{fi) 
s{fi)J{r])d{ri). 

Note that J{rj) = J{P{l\r])) is a convex function of r} over the [0 1] interval. Intuitively, the 
more concentrated 5(17) is near and 1 the larger the s(r} ) Jfr?) term will become. In o ther words 
S Refinement wiU increase as ?}(x) becomes more refined DeGroot and Fienberg ( 19831) . We will 
formalize this and present the inner product interpretation of refinement in Section-[6] 

As an example, the expected score of the strictly proper Brier score (BS) (or least squares) 
ly' = (fl — y'Y where y' = can be ex pressed as a measure of calibration and refinement 
Murphvl (fl972h : beGroot and FienbergI (Il983b 



Sbs 



s{fi) ( p{i\m - 1)' + p{-imvr]d{v) 
2 



(14) 



/ s(77)(7}-P(l|r))) d{f^) + j s{f,)P{l\f])(^P{l\f])-iy{fj) 

S Calibration ~l~ S Refinement- 
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The expected score Sj^ is maximized when the forecaster is calibrated i) = P(l|r/) = r] and the 
distribution of predictions 5(17) are mostly concentrated around and 1 since -P(l|??) ^P(l|f/) — 1^ = 
r/(?7 — 1) is a symm etric convex function of on th is interval with minimum atrj = ^ and maximums 



atrj = and t] = 1 iDeGroot and FienbergI (Il983h . 



3. Further Analysis Of The Refinement Concept 

In this section we present a series of three arguments from different angles that further clarify and 
extend the concept of refinement. The first is an argument based on Cox's theorem on subjective 
probability that basically points out a subtle yet important flaw in the assumptions that might be 
made in understanding the refinement concept. 

3.1 Argument based on the basic desiderata of probability 

A forecaster is simply producing subjective probabilities. It is well understood that subjective prob- 
abilities are based on the axioms of Cox's theory whi ch are elegantly presented as the desiderata of 



probability in the form of three logical statements in iJaynes and Bretthorsn (120031) . It is the failure 
to strictly follow these requisites that has led to many unnecessary errors, paradoxes and contro- 
versies in probability. Here we show that the concept of refinement might seem to contradict the 
third desiderata of probability if not presented correctly, namely that of consistency. This requires 
that 1) if a conclusion can be reasoned out in more than one way, then every possible way must 
lead to the same result, 2) the forecaster always takes into account all the evidence it has relevant 
to the question and does not arbitrarily ignore some of the information and 3) if in two problems 
the forecasters state of knowledge is the same, then it must assign the same probabilities in both. It 
is also important to note that subjective probability and their logic does not depend on the person 
or machine making them. Anyone who has the same information but come s to a different proba 



bilitv assignment is necessarily violating one of the desiderata of probability iJaynes and Bretthorst 



(120031). 

Ignoring the above requisites can lead to a misunderstanding or contradiction when considering 
the concept of refinement. This can best be presented with an example similar to that in Section-|2] 
Assume that two calibrated forecasters A and B have access to the same information, for the sake of 
argument we assume this to be data x in the form of air pressure readings which is a good indicator 
for predicting rain. Also assume that the actual probability of rain given air pressure x is known to 
be P{l\x) = 0.7. In terms of forecasters, the consistency property requires that each x lead to a 
corresponding forecast 7) (and ry) and that no x lead to more than one forecast fj (and ry). In other 
words fj and rj are functions of the information x such that we can write 77(2;) and ri{x). 

Let Forecaster A make the prediction that chance of rain is f)A{x) = 1 and forecaster B make 
the prediction that chance of rain is t)b{x) = 0.7. It might initially seem that forecaster A is more 
refined than forecaster B, but in fact the consistency principle of probability elicitation is being 
violated. In other words, since both forecasters are basing their forecasts on the same information 
X, they should both make identical predictions. 

We extend the concept of forecasters and require two more reasonable properties from a fore- 
caster. First, a forecaster should be responsive. In other words different information must lead to 
a different forecast. Formally, we require that if the information xi / X2 then t){xi) ^ f]{x2) and 
r]{xi) / 'r]{x2) ■ This is equivalent by definition to requiring that r}(x) and r]{x) be one-to-one func- 
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tions. Second, a forecaster should be encompassing and any forecast should be possible. Formally, 
their exists a corresponding x for any r) and -q. This is equivalent by definition to requiring that r}(x) 
and ri{x) be onto functions. Both required properties can be summarized by equivalently requiring 
that fj{x) and rj{x) be invertible functions. The immediate consequence of invertibility is that 



This in turn leads to another contradiction in the example above meaning that forecaster A is 
not actually calibrated. If as stated P{l\x) = 0.7 then ri{x) = P{l\i'){x)) = P{l\x) = 0.7 while 
forecaster A predicted t)a{x) = 1 ^ i]{x) i.e. forecaster A is not calibrated as initially claimed. 
Forecaster B, on the other hand, is verifiably calibrated. 

3.2 Extending refinement to the underlying data distribution setting 

The discussion and example presented in Section |3?T] suggest that a forecaster and its measure of 
refinement depend on the underlying data distribution P{l\x) from which the forecasts are estab- 
lished. This can be formally presented by writing the expected score as 



where ?}' = f)'(rE) = and we have made use of the change of variable theory from calculus and 
function of random variable theory from probability theory. Using this theory demands that fj{x) be 
an invertible function as previously required for a forecaster. 



r]{x) = P{l\fi{x)) = P{l\x). 



(15) 




y 



(16) 
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The refinement term ([T3T i can also be similarly reduced to 



Sr efinement 



X 



sir,) [Pil\fi)h{rj) + P(-l|^)/_i(r/)J dif,) 
P(t\1) P(r\ — \) 

-4fip(l)/i(r?)+ ^ ' P{-1)I.M {fi'dx)) 
P{x\l)P{l)h{r]) + P{x\ - l)P(-l)/_i(r?) 



(17) 



dx 



X 



X 



Px{x)Y,nyWy(r,)dx 

y 

Px{x)J{ri)dx 
Px{x)J{P{l\x))dx. 



X 

The above formulation shows that the distribution of forecasts s{fj) in the original refinement for- 
mulation (fT3l) reduces to Px{x) which is the distribution of the data. This means that the refinement 
of a forecaster has nothing to do with how good the forecaster is but depends on the distribution of 
the underlying data itself which is outside the control of the forecaster. Given observations x the 
best a forecaster can do is be calibrated. This can also be seen by noting that refinement is a constant 
term independent of the forecaster predictions 17 and only depends on the distribution of the data. 
This observation leads us to make a connection with the Bayes rule in decision theory which we 
explore in the next section. 



3.3 Refinement and the Bayes rule 

We can think of a forecaster as a kind of classifier that tries to classify days into rainy or sunny. 
We again assume that the forecaster/classifier has access to a set of observations x, for example 
air pressure. What is the optimal decision a forecaster can make? The Bayes mle tells us that the 
optimal decision is to choose rainy if P{l\x) > P{—l\x) and sunny othewise; or equivalently the 
forecasters predictions should be chance of rain is r} = P{l\x). This, by definition, is simply the 
requirement of a calibrated forecaster t) = P{\\x) = -q. 

This can also be written as choose rainy if ^^^p^^^^^^ > ^^^^~p(x)^~^^ ■ Assuming no prior 
knowledge of the chance of rain on any given day we can write choose rainy if P{x\l) > P(x| — 1). 
We see that the optimal decision rule depends only on the distribution of the data P{x\y). Given that 
two forecasters have access to the same air pressure readings, the best forecast they can each give 
on any given day depends on the distributions of P{x\l) and P{x\ — 1) and is simply fj = P{x\l). 
Given equal access to data x, both forecasters will make identical predictions. A central part of 
Bayes decision theory is the Bayes error. We will return to the subject of Bayes error and its 
connections to refinement in Section-|8] 



3.4 Clarifying the refinement concept 

At this point, given the thi^ee ai^guments above, it is evident that the concept of refinement can only 
be meaningful when comparing forecasters that use different types of evidence or data to form their 
predictions. For example, it could be such that one forecaster uses xi air pressure and another uses 
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an unrelated type of data such as X2 water temperature to make their predictions. In this case both 
forecasters can be calibrated but one can be more refined than the other since their data distributions 
P{xi) and P{x2) are different. 

In summary, for a fixed data type xi with distribution P{xi), the best forecaster possible is the 
calibrated forecaster and all other forecasters that base their predictions on this type of data xi can at 
best be identical to the calibrated forecaster. The only way to improve on the forecaster's predictions 
is to use a different type of data or feature X2 with a different distribution of P{x2) resulting in a 
calibrated forecaster that has higher refinement. This brings us to the notion of feature selection and 
its connections to refinement which we explore in the next section. 



4. Refinement, Maximum Marginal Diversity And Conditional Entropy 



In thi s section we show that conditional entropy and maximum margin diversity IVasconcelosI (12002 , 
l2003b are both special cases of the extended concept of refinement in the underlying data distribution 
setting when considering the logistic maximal reward function J{r]) = r]log{r]) + {l — ri) log(l — r]). 



4.1 Refinement and Maximum Marginal Diversity 

The principal of maximum marginal diversity IVasconcelos (I2002l . l2003b is studied in feature selec- 
tion and states that for a classification problem with observations drawn from a random variable 
Z G Z and a feature transformations Xi : Z ^ Xi, the best feature transformation is the one that 
leads to a set of maximally diverse marginal densities where the marginal diversity for each feature 
is defined as 



md{Xi)= J2 PY{y)DKL{P{xi\l)\\P{xi)). 
j/={i-i} 



(18) 



In other words the best feature to use for classification is the one that has the highest md(Xj). 
Choosing a feature Xi with maximally diverse marginal density is equivalent to choosing a feature 
with the highest refinement using the logistic J(??) function. This can be shown by writing ([TT] ) as 



Sr efinement — 

P(x|l)Py(l)/i(P(l|x)) + P{x\ - l)Py(-l)/_i(P(l|x)) 



(19) 



X 



X 



P(x|l)Py(l)/i(^^^^^|^) + P{x\ - l)Py(-l)I_l( 



dx 

P{x\1)Py{1] 
P{x) 



dx. 



For the special case where J(r/) = r/log(r/) + (1 — r/) log(l — r/) such that /i(r/) = log(r/) and 
/_i(ry) = log(l — Tj) we have 



Sr efinement 



X 



P{x) 



P{x) 



(20) 

dx 



= Py{1)Dkl{P{x\1)\\P{x)) + Py{-1)Dkl{P{x\ - l)\\P{x)) 
+Py(l) log(Py(l)) + Py(-l) log(Py (-1)). 
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Assuming that -Py(l) = 7 we can write 

SRef^nement = Py{1)DkL (P(x| 1) | |P(x) ) + Py{-1)DkL {P{x\ - l)\\P{x)) (21) 
+Py(l) log(Py (1)) + Py(-l) log(Py (-1)) 
= md(x) + 7 log(7) + (1 - 7) log(l - 7) 

and maximum marginal diversity is equivalent, up to a constant, to the refinement formula for the 
special case of when J(ry) = r/Iog(r/) + (1 — ry) log(l — r/). The consequences of realizing such 
an equivalence is that in the case of probability elicitation one realizes that the best a forecaster can 
do using a certain feature such as x = air pressure is to be calibrated, increased refinement can only 
come from using better features such as maybe x = air humidity. The insight gained in terms of 
feature selection is that the KL-divergence is not unique and that other valid J{ri) functions such as 
those in Table-|4]and plotted in Figure-|4] and Figure-|5]can be used to find refinement formulations 
as seen in Table-[3]and Table-[5] The question that still remains is how different convex J{ri) differ 
in terms of their feature selection properties. We consider this problem in Sections-[8]and|9]. 

4.2 Refinement, mutual information and conditional entropy 

Refinement also has a close relationship with mutual information and conditional entropy. From 
((2TI) we write refinement for the special case of J(ry) = rylog(ry) + (1 — 77) log(l — r/) as 

SRefinement = Py {I) D K l{P {x\l)\\P {x)) + Py {-I) D K l{P {x\ - l)\\P{x)) (22) 
+Py(l) log(Py(l)) + Py(-l) log(Py(-l)) 

= Y,PY{y)DKL{P{x\y)\\P{x)) + Y,PY{y)\og{PY{y)) 

y y 

= I(x;y) + ^Py(y)log(Py(y)) 

y 

= {H{y) - H{y\x)) - H{y) = -H{y\x) 

where I{x;y) is the mutual information and H{y\x) is the conditional entropy. This shows that 
conditional entropy is a special case of the refinement score when the logistic J(ry) is used. Note 
that a higher refinement is a number that is less negative which corresponds to a lower conditional 
entropy. In other words if y is completely determined by x then the conditional entropy will be zero, 
which corresponds to maximum refinement. 

Refinement can be directly used for feature selection and is closely related to conditional mutual 
information or conditional entropy conditioned on two or more variables. Ranking all features by 
their refinement score is not very useful because this does not take into account the dependencies 
that exist between the features. Simply using the first n highest ranked features is usually a bad 
idea since most of the first few features will be redundant, related and dependent. We would like to 
choose the second feature z such that it not only provides information for classifying the class y, but 
is also complementary to the previously chosen feature x. This can be accomplished by considering 
the conditional refinement score defined as 

S Conditional Refinement — ^ ^^ P{x ^ Z)J {P{^\x ^ z)) . (23) 
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Conditional entropy is a special case of conditional refinement when the logistic J{7]) = rj log(r/) + 
(1 — rj) log(l — rj) is used 

SconditionalRefinement = ^ P{x, z)J{P{l\x, z)) (24) 

= Pi^, z)[P{l\x, z) log(P(l|x, z)) + (1 - z)) log(l - P(l|x, z))] 

x,z 

= Y P{x, z) ^ P{y\x, z) log(P(y|x, z)) 

x,z y 

= ^ P{x, z, y) \og{P{y\x, z)) 

x,z,y 

= -H{y\x,z). 



A more practical formula for conditional refinement can be written as 

SconditionalRefinement — ^ P i.-^ i z) J {^P {^^\x ^ z)) 

x,z 

= P{z\x,y)P{x\y)P{y)\og{P{y\x,z)) 



(25) 



x,y,z 



uf I \r>f I \} (P{Ax,y)P{x\y)P{y)^ 

Y P{z\x,y)P{x\y)P{y)\og{ p^^\^^p^^^ ) 



x,y,z 



where we have used the logistic J{r]) to demonstrate. The above formula iteratively picks the 
best feature conditioned on the previously chosen feature. Note that all the distributions can be 
estimated with one dime nsional histograms. Conditional entropy has been successfully used in 
Fleuret and GuvonI (12004) as the basis of a fea t ure selection algorithrt i that has been shown to out- 
perform boosting Ipreund and Schapire ( 1997 ): Friedman et al. (2000) and other classifiers on the 
datasets considered. Finally, note that similar to refinement, different conditional refinement scores 
can be derived for different choices of convex J(?]). 

5. Refinement And Calibrated Classifiers 

Probability elicitation and classification by way of condi t ional risk minirnizatio n are closely relate d 



and have been most recently stud i ed in iFriedman et al.l (120001) 



Zhand (l2004l^ : iBuja et al.1 (l2005h : 



ngl 

Masnadi-Shirazi and VasconcelosI (120081) : iReid and WilliamsonI (120101) . A classifier h maps a fea- 
ture X ^ X to 2i class label y G {—1, 1}. This mapping can be written as h{x) = sign\p{x)] for a 
classifier predictor function p : X ^ M. A predictor function is called an optimal predictor p*{x) 
if it minimizes the risk 



R{p) = ExAL{p{x),y)] 



(26) 



for a given loss L{p{x),y). This is equivalent to minimizing the conditional risk Ex\y[I^{p{x), y)\X 
x] for all X. Classification can be related to probability elicitation by expressing the predictor as a 
composite of two functions 



p{x) = f{fi{x)) 



(27) 
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where / : [0, 1] — M is called the link function. The problem of finding the predictor function is 
now equivalent to finding the link and forecaster functions. A link function is called an optimal link 
function /*(??) if it is a one-to-one mapping and also implements the Bayes decision rule, meaning 
that it must be such that 

' /* > if r/(x) > i 

< r = ifr,{x) = \ (28) 
^ /* <0 if7?(x) < i. 

Examples of optimal link functions include /* = 2i] — 1 and /* = log where we have omitted 
the dependence on x for simplicity. 

A p redictor is denoted cal i bratecviDQGrooi and Fienbergl(ll983h : IPlattlfcOOd) : iNiculescu-Mizil and Caruanal 

(l2005h : iGneiting and Rafteryl (120070 if it is optimal, i.e. minimizes the risk of (|26] |. and an optimal 
link function exists such that 



V{x) = if*) {p* [x)) = f}{x) 



(29) 



The loss L{p{x), y) associated with a calibrated predictor is called a proper loss function. 

In a classification algorithm a proper loss function is usually fixed beforehand. The associated 
conditional risk is 



C7L(r?,/) = 77L(/,l) + (l-7?)L(/,-l), 
the optimal link function is typically found from 

fliv) = argmmCL(77,/) 

and the minimum conditional risk is 

For example, in the case of the zero-one loss 



Lo/iif,y) = 
the associated conditional risk is 

Co/i{r],f) 



0, if y = sign{f); 

1, ify^signif), 



1-7], if/>0; 
r], if/<0, 



the optimal link can be /* 



27? - 1 or /* = log ^ 



and the minimum conditional risk is 



Co/iiv) =nim{r],l-r]}. 



(30) 



(31) 



(32) 



(33) 



(34) 



(35) 



Margin losses are a special class of loss functions commonly used in classification algorithms 
which ai^e in the fomi of 



L^{f,y) = Hyf)- 



(36) 
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Table 1: Proper margin loss (^(t;), optimal link f^{r]), optimal inverse link (/^) ^(v) and maximal reward J{ri). 



Loss 






(/|)-^{") 


•/{'■/) 


LSZhana (2004) 
ExpZhana (2004) 
LogZhang (2004^ 
SavageM^nadi-Shirazi and Vasconcelos (222S 
TangentMasnadi-Shirazi et al. (2010) 


exp(— «) 
log(l + e-) 

4 

(2arctan(w) - 1)^ 


2»7- 1 
tan (77 - i) 


l + u 
1+e" 

arctan(ii) + ^ 


-2r,(l - ri) 
-2V'?(1 - r/) 
77logr;+ (1 - T;)log(l - r;) 
~4ri(l - r,) 
-4r,(l - 7,) 



Maigin l oss functions assign a non ze r o penalty to posi t ive y f called the margin. Algorithms such as 
boosting iFreund and Schapirel (1 1997b : Friedman et all (120001) are based on proper margin loss func- 



tions and have not surprisin gly demonstrated superior perforrnance g i ven their consistency with the 
Bayes optimal decision rule lFriedman et al. I fcOOOl) : iBuja et all (120051) : lMasnadi-Shirazi and VasconcelosI 
([20081). Table-[T]includes some examples of proper margin losses along with their associated optimal 
links and minimum conditional risks. 

The score functions /i, /-i and maximal reward function J{r)) can be rel ated to proper margin 

losses and the minimum conditional risk by considering the following theorem lMasnadi-Shirazi and Vasconcelos 
(120081) which states that if J(r/) defined as in dUl is such that 



J(r?) = J(l - r?) 
and a continuous function (77) is invertible with symmetry 

{f;r\-v) = i-{f;r\v), 

then the functions Ii and /_i derived from ^ and (ITOl ) satisfy the following equalities 



(37) 



(38) 



^i(^) 



(39) 
(40) 



with 



(v) = -jmrHv)} - (1 - {f;)-Hv))j'{if;)-Hv)}. 



(41) 



An important direct result of the above theorem is that J(ry) = —C^{r]). 

The above discussion connects refinement to the classification setting and we can write refine- 
ment in terms of the calibrated classifier outputs v = f{i]). Specifically, assuming a calibrated 
classifier based on a proper loss function we have 



V = p{x) = f{ri{x)) = f*{r]{x)) 



(42) 



and 



fiix) = vix) = in-\v), 



(43) 
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Table 2: The J((/^)"^(«)), the domain of V over which it is defined and the coixesponding J{ri) and (/^) ^{v) 



Loss 



-min{i^,l-2i±i} 
-min{^^,l- il^} 
_ 1) 



2 

D+1 

2 

1+e" 

e" 
l+e" 

arctan(ii) ■ 



J(r,) 



Domain 



Zero-One-A 
Zero-One-B 
LS 
Exp 
Log 
Savage 
Tangent 



0.72n[^ - log(l + e-)] 

-2e" 

2(arctan(t'))'^ — ^ 



— min{?7, 1 — i]} 

— min{j7, 1 — r;} 

-v''?(i-'?) 

0.7213[/?log(,?) + (l-,/)log(l-'^)] 
-2»?(1 - r?) 
-2»?(1 - rj) 



[-11] 
[— cx) oo] 

[-11] 
[— (x> oo] 
[— oo oo] 
[— oo oo] 
-ta.n(^) tan(^) 



and the refinement term can be written as 



S 



Refinement 



s{fi)J{r])d{fi) 

s{{f;)-Hv))j{{f;)-Hv))d{{f;)-\v)) 

s{v) 



= / s{v)j{{f;)-Hv))div). 

J V 

For the special case of the proper margin Log loss of Table-[T]we have 
J(??) = 0.7213[r/ log(r?) + (1 - r?) log(l - 7?)] 



and 



so J{{f^)~^{v)) = J{ri) can be simplified to 

J{{f;)-\v))=^.72U 



ve 



log(l + e") 



(44) 



(45) 
(46) 

(47) 



and is plotted in Figure-[T] The J{{f^)~^{v)) functions associated with the proper mai^gin losses of 
Table-dJare presented in Table-|2]and plotted in Figure-[T] Refinement for the log loss can be written 
as 



:finement= / s{v)J{{f^) ^{v))d{ 
J V 



v) = 0.7213 / s{v) 



ve 



logfl + e"^] 



d{v) (48) 



where we reiterate that s{v) is the distribution of the classifier's predictions. 

All of the plotted J{{fl)~^{v)) functions in Figure-H] are quasi convex. This is shown to be 
always the case by considering the derivative 

dJ{{fi)-\v)) _ dj{{f;)-^{v)) d{f;)-'{v) 



dv 



d(j;)-Hv) dv 



(49) 
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Figure 1: Plot of J{{f^)~^{v)) for different loss functions. 

and the fact that {f^)~'^{v) is a nondecreasing invertible function and J{7]) is convex. Since 

^^'^'''dv ^ ^ ^"^^ J'{{f4,)~^{'^)) changes sign only once, the derivative of J{{f^)~^{v)) also 
changes sign only once proving that J{{f^)~^{v)) is quasi convex. 

Given the quasi convex shape of J{{f^)~^{v)), the refinement of a classifier increases when the 
distribution of the classifier predictions s{v) is concentrated away from the decision boundary. A 
classifier with predictions that are concentrated further away from the boundary is preferable and 
the refinement of a classifier can be thought of as a measure of the classifier's marginal density. 
This observation is formally considered in Section-|7]and allows for the comparison of calibrated 
classifiers based solely on the distribution of their predictions. 

We note th at although the concept of classifier marginal density seems to be related to maximum 



margin theory IVapnikI (119981) in classifier design, there are a few key difference. 1) Calibrated 
classifiers that are built from the same underlying data distribution will have the same classifier 
marginal density but can have different margins. The notion of margins is thus in contradiction to 
the axioms of probability theory, while the concept of classifier marginal density is not. 2) margins 
are only defined for completely separable data, while classifier marginal density does not have such 
restrictions. 3) While the margin of a classifier considers only the data that lie on the margin, the 
notion of classifier marginal density considers the entire spread and distribution of the data. 

6. Further Insight Into refinement In The Original Probability Elicitation Setting 

The original refinement formulation is in the probability elicitation setting and was formulated as 

SRefinement= I s{fj)J{ri)d{fl) (50) 
Jf) 

in Section-|2] As mentioned previously, it is intuitive that refinement increases as the distribution 
of the predictions s(?}) concentrates around 77 = and ?} = 1. We fomialize this intuition in this 
section and derive the maximum and minimum refinement scores using an inner product Hilbert 
space interpretation. 
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Real cont inuous function s f{x) and g{x) that are also square integrable form an inner product 
Hilbert space David G ( 19691) where the inner product is defined as 



<f,9>= J f{x)g{x)d{x) (51) 

with induced norm of 

{f{x)fd{x). (52) 

Let / \J{ri)\^ d{ri) < oo and / \ s{f])\'^ d{r]) < oo then J{i]) and s{fj) are square integrable func- 
tions. The inner product associated with this inner product Hilbert space is 

<s,J>= fs{f,)J{r,)d{fi). (53) 

which is equal to the original refinement formulation of (|50] |. In other words refinement computes 
the inner product between the two functions J(r/) and s(?}). As seen in Table-[T] J(ry) < and 
s{fj) > since s(?}) is a probability distribution function. This constrains the refinement score 
to < s,J >< 0. The maximum and minimum refinement scores for a fixed J(r/) can now be 
computed by considering the inner product between a fixed J{7]) and a distribution of prediction 
functions s{fj). Specifically the minimum refinement score is 

Sf^f^nement =< S, J >= M ■ \\J\\ ■ 008(0) = a\\J\\ ■ \\J\\ ■ (-1) = -a\\jf (54) 

and con^esponds to when s = —aJ for some multiple a. The maximum refinement score is 

SRefinen^ent =< S, J >= 1 1 s| | ■ 1 1 J| | • COs{9) = | |s| | • 1 1 J| | • (0) = (55) 

and con^esponds to when s _L J. 

Usually, the score functions Ii and /„i are chosen to be symmetric such that = — ?]) 
so that the scores attained for predicting either class y = {1,-1} remain class insensitive. In this 
case the corresponding J(r/) is also symmetric such that J(r/) = J(l — ry). This can be confirmed 
by noting that 

,7(1 -V) = (l-r/)Ii(l-77) + (l-l + 77)I_i(l-r/) (56) 

When J(r/) < is convex symmetric over G {0 1} then J{r]) is minimum at = ^ and J(0) = 
J(l) = and the maximum refinement score verifiably corresponds to when all of the predictions 
are either or 1 such that s{fj) = 'y6{fj) + (1 — 7)5(1 — fj) where < 7 < 1. The s(f/) pertaining 
to the cases of maximum and minimum refinement are plotted for a hypothetical symmetric J{rj) in 
Figure-El 
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Figure 2: Plot of 5(77) for maximum (left) and minimum (right) refinement with a hypothetical J{rj). 



7. Further Insight Into Refinement In The Classifier Output Setting 

In Section-[5]we stated that when considering refinement in the classifier outputs setting under the 
formulation of (l44l ). refinement increases as s{v) is concentrated away from the boundary. This can 
be formally addressed by letting s{v) and J{{f^)~^{v)) be square integrable functions that form an 
inner product Hilbert space with inner product 

<s,J>= [ s{v)J{if;)-\v))d{v). (57) 

Tiiis is equal to the refinement formulation of (l44l ) associated with the classifier output setting. 
An argument similar to that of Section-|6] leads to the conclusion that refinement in the classifier 
output setting is minimum when the distribution of classifier outputs s{v) = —aJ{{f^)~^{v)), 
increases as s{v) concentrates away from the decision boundary and is maximum when s{v) = 
limj__j.oo 7S{v — t) + (1 — 'j)6{v + t) where < 7 < 1. 



8. Further Insight Into Refinement In The Underlying Data Distribution Setting 

In Section- ll!2] we showed that the refinement score can be reduced to the underlying data distribu- 
tion setting as 

SRefinement= / Px{x) J {P{l\x))dx . (58) 

Jx 

Here we expand on this formulation and fonnalize its connections to the Bayes error and eventually 
derive novel measures that provide arbitrarily tighter bounds on the Bayes eixor. 

First we show that refinement in the data distribution setting is also an inner product Hilbert 
space with inner product defined as 

<Px,J>= I Px{x)J{P{l\x))dx. (59) 
Jx 

This follows directly from letting Px{x) and J(P(l|x)) be square integrable functio ns which is 



not a s tringent constraint since most probability density functions are squai^e integrable iTang et al. 



(12000h . We also directly show that J(P(l|x)) < is quasi convex over x. This follows from 

dJ{P{l\x)) _ dJ{P{l\x)) dP{l\x) 
dx dP{l\x) dx 

= J'(P(l|x))^, 



(60) 
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the fact that r/(x) = P{l\x) is an invertible and hence monotonic function from ([TS] ) and J(ry) is 
convex. Since ^^gl^'^^ > Vx or ^^g^^^ < \/x and J'(P(l|x)) changes sign only once, the 
derivative of J{P{l\x)) also changes sign only once proving that J{P{l\x)) is quasi convex. 

Once again, refinement in the data distribution setting < Px ,J > < is minimum when 
Px{x) = —aJ{P{l\x)), and is maximum and equal to zero when Pxix) -L J(P(l|x)). 

Assuming equal priors P(1) = P(— 1) = ^, 

= ^'"1" +f "I - '> (61) 

and 

P(lb) = , , , -. (62) 

^ ' ' P{x\l) + P{x\ - 1) ^ ^ 

We can write refinement in terms of the underlying data distributions P{x\l) and P{x\ — 1) as 

f P{x\l)+P{x\-1) ^ P{x\l) 

SRefrnement = J ^{ ^ ^^^^ P(x| 1) + P(x| - 1) ^^"^^ ^^^^ 

For example under the least squares Jls{P{Mx)) = 2P(l|x)(P(l|x) — 1), the refinement 
formulation simplifies to 

_ f -P{x\l)Pix\-l) 

^Refinement- J ^p^x\l) + P{x\ - 1)) 

Plot-[3] shows the P{x), J{P{l\x)) and P{x)J{P{l\x)) terms for three Gaussian distributions of 
unit variance and means of /i = ±0.1, fi = ±1.5 and = ±4. In accordance with the inner 
product interpretation, as the means sepai^ate and the two distributions P{x\l) and P{x\ — 1) have 
less overlap, the refinement increases (is less negative) and approaches zero. 

In Table-[3]we have derived the refinement fomiulation for the different J(P(l|a;)) of Table-|4] 
which are plotted in Figure-|4] Refinement for the zero-one maximum conditional score function 

^'/'^"^^ = [ -rj, if^<l ^^^^ 



IS 



S%)^nement = / P{x)Jo/l{P{l\x))dx = (66) 

r p(x|i) + p(x|-i) pm 

yp(ii.)>i^ 2 p(x|i)±p(xi-i)^^^"+ 

f P{x\l)+P{x\-l) Pix\l) 

7p(i|.)<i^ 2 ^P(x|l) + P(x|-1)^^^^- 

-l I P{A - ^)dx - \ I P{x\l)dx = -\{ei + 62) = -e 
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Figure 3: Plot of the Jls refinement terms for three different unit variance Gaussians. 



where €2 is the miss rate, ei is the false positive rate and e is the Bayes eiTor rate. In other words, 
refinement under the zero-one Jo/i(f?) is equal to minus the Bayes error. When refinement is com- 
puted under the other J{'q) of Table-|4l an upper bound on the Bayes error is being computed. This 
can be fonnally written as 



c-o/i _ nJ{-n) ^ _ nJ{-n) 

Refinement Refinement Refi 



''Refinement "^Refinement ~" Refinement 

f Pxix)Jo/i{P{l\x))dx - f Px{x)J{P{l\x))dx = 
^ Px{x){^Jo/i{P{l\x))-J{P{l\x))^dx. 



(67) 



In other words, the J{r]) that are closer to the Jg/i {rj) result in refinement formulations that provide 
tighter bounds on the Bayes error. Figure-|4]shows that J^g, Jcosh^ Jsec, Jiog, Jhog-Cos and Jexp 
are in order the closest to Jg/i and the corresponding refinement formulations in Table-|4]provide in 
the same order tighter bounds on the Ba yes error. This ca n also be directly verified by noting that 
Sexp is equal t o the Batta c haryy bound i Fukuna gal dlQQOh . Sls is equal to the asymptotic nearest 
neighbor b ound 
divergence iLinI ( 



Fukunaga (1990); Cover and Harg ( 1967 ) and Sloq is equal to the JensenShannon 



199lh . These three formulations have been independently studied throughout the 



literature and the fact that they produce upper bounds on the Bayes error have been directly verified. 
Here we have rederived these three measures by resorting to the concept of refinement which not 
only allows us to provide a unified approach to these different methods but has also led to a sys- 
tematic method for deriving novel refinement measures or bounds on the Bayes error, namely the 

Scosh, Shog-Cos and the Ssec- 
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Table 3: Refinement measure for different J{i]) 



Sr, 



efinement 



Zero-One 



Bayes Error 



Jx P(x\l)+P(x\-1)^^ 



LS 



Exp 



^J^^P{x\l)Pix\-l)da 



u.V2iy 



Log 



DKL{P{X\1)\\P{X\1)+Pix\-1)) 



DKLiP{x\-l)\\P{x\l)+Pix\-l)) 



,.„„, 2.58541J^(j;|l)-J^(T|^T7y 
2(P(x|l) + P(x|-l)) . 



Log-Cos 



P(x|l)+P(^|-l) 
2 



2.5854 



log 



Cosh 



P{x\l)+P(x\-1) 
2 



„„„-h / 1.9248(P(x|-l)-P(xil)) - 
LUbn[ 2(P(i:|l)+P(a;|-l)) . 



cosh(- 



-1.9248 N 



dx 



Sec 



P{x\l)+P{x\-1) 
2 



sec 



. 1.6821(P(a:|-l)-P(x|l)) s^ _ . -1.6821 n 
^ 2(P(a;|l)+P(a:|-l)) ) ^^'"^ 2 'I 



dx 



Table 4: J specifics used to compute the refinement score. 



Method 


J{r,) 


LS 


Mv - 1) 


Log 


0.7213(7? log(r?) + (1 - ry) log(l - t])) 


Exp 


- 1) 


Log-Cos 


12.5854) ^Og I cos(^-\«5^) ^ 


Cosh 


cosh(1.9248(^ r/)) cosh(-^f^«) 


Sec 


sec(1.6821(^ rj)) sec{-'f''') 
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Figure 4: Plot of the J{ri) in Table-g] 



9. Measures With Tighter Bounds On The Bayes Error 



Although the three novel refinement score functions discussed above provide relatively tighter up- 
per bounds on the Bayes error, they do not produce the tightest bounds. In Tahle^Jisiv) provides 
the closest approximation to Jo/iiv)^ thus resulting in a tighter bound. A natural question is if the 
refinement approach can be used to derive fomiulations that provide even tighter bounds on the 
Bayes error. In order to do so, (l67i states that we simply need to find J(?]) that are closer approxi- 
mations to Jq/i [t]). In this section we derive polynomial functions Jpoiy{i]) that are arbitrarily close 
approximations to Jq/i (r?) thus leading to m easures that h a ve the tightest bounds on t he Bayes error. 

The Weiers trass approximation theorem [Bartle (1976); Burden and Faires ( 2010 ) states that for 
a continuous function f{x) defined on [a, b] there exists a polynomial P{x) that is as close to f{x) 
as desired such that 



<e, Vxe [a,6]. (68) 

With Jo/i(??) as the target function, we demonstrate a general procedure for deriving a class of 
polynomial functions Jpoiy^niv) that are as close to Jo/i(??) as desired. As an example, we derive 
the Jpioy~2{'n) which leads to the Spoiy~2 bound on the Bayes error which is a tighter bound on 
the Bayes error than Jisiv)- We also derive the Spoiy-4 bound which is an even tighter bound and 
show that Jls = Jpoiy-o- 

When J{ri) is convex symmetric over rj £ {0 1} then J(r/) is minimum at rj = ^ and so 
J'(^) = 0. The symmetry J{ri) = J(l — r/) results in a similar constrain on the second derivative 
J"{n) = J"{1 — rj) and convexity requires that the second derivative satisfy J" (77) > 0. The 
symmetry and convexity constraint can both be satisfied by considering 

Jpoly-ni^) = i^ia-r])r (69) 

where n is an even number. From this we write 

Jpoiy-niri) = [ - r})Td{ri) + = Q{ri) + K^. (70) 
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Qih- (71) 



Satisfying the constraint that Jpoiy-nih) ~ ^' 

Ki = - Jivil-vTdiv) 

Finally, Jpoiy-niv) is 

Jpoiy-n(??) = J Q{v)d{r,) + KiT]) = K2{R{v) + K^]), (72) 
where K2 is a scaling factor such that 

K,= ,.^., . (73, 



. 1 



In other words this scaling factor is set to satisfy Jpoiy~n{^) = Jo/i{'^) = 
As an example, we derive Jpoiy^2- Following the procedure above 

Jpoiy^2{v) = - V)f =V^ + r]^- 2r,'' > 0. (74) 

From this we have 

Jpoiy-2{r]) = + 1,^5 _ 2^4 ^ (75) 

Satisfying J'p^iy_2{l) = we find Ki = -0.0167. Therefore, 

Jpoiy-2{r]) = K2{^V^ + - 1,^5 ^ (_o.oi67)r?). (76) 

Satisfying Jpoiy-2{\) = -\ we find K2 = 87.0196. 

Figure-[5]plots Jpoiy~2{vi) which shows that, as expected, it is a closer approximation to Jg/i {t]) 
when compared to Jls{v)- Following the same steps, it is readily shown that Jls{v) = Jpoiy-oif])' 
meaning that Jisiv) is derived from the special case of n = 0. As we increase n, we increase 
the order of the resulting polynomial which provides a tighter fit to Jo/i(^)- Figure-|5] also plots 

Jpoly-Aiv) 

Jpoly-i{ri) = {11) 
1671.3(^r?io - ^rf + ^r,^ - ^r,' + ^r?^ + (-7.9365 x 10-^)7?) 

and we see that this provides an even closer approximation to Jo/i(??). Table-[5] shows the cor- 
responding refinement measure for each of the Jpoiy-niv) functions, with Spoiy-4 providing the 
tightest bound on the Bayes error. Arbitrarily tighter bounds are possible by simply using Jpoiy~n 
with larger n. 
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Figure 5: Plot of Jpoiy-n{v)- 



Table 5: Refinement measure for different Jpoiy-niv) 



Av) 


^Refinement 


Zero- One 


Bayes Error 


Poly-0 (LS) 


r -F(x\l}F(x\-l) , 
J P{x\l)+P{x\-1)'^-^ 


Poly-2 


2 J 12(2P(a;))^ ' 30(2P(2;))5 10(2P(a:))'' i^l^W^^ja^. 

Ki = 0.01Q7,K2 = 87.0196, P(x) = p(^\'^)+p(^\-'^) 


Poly-4 


K2 r P(x\ir PNir 1 3P{:.|1)» 2P(x|l)' P(x|l)>' pc^MN . 
2 J 90(2P(a;))« 18(2P(a:))« ' 28(2P(2:))7 21(2P(s))« ' 30{2P(x))^ J\ll' [X\l JUX. 

Ki = 7.9365 X 10-4,7^2 = 1671.3, P(x) = ■^(-11)^^-1-1) 



10. Conclusion 

The concept of refinement was first established in the probability elicitation literature and despite 
its close connections to proper scoring functions, has largely remained restricted to the forecasting 
literature. In this work we have revisited this important statistical measure from the viewpoint of 
machine learning. In particular, this concept is first considered from a fundamental perspective with 
the basic axioms of probability. This deeper understanding of refinement is used as a guide to ex- 
tend refinement from the original probability elicitation setting to two novel formulations namely 
the underlying data distribution and classifier output settings. These three refinement measures were 
then shown to be inner products in their respective Hilbert spaces. This unifying abstraction was 
then used to connect ideas such as maximum marginal diversity, conditional entropy, calibrated 
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classifiers and Bayes error. Specifically we showed that maximal marginal diversity and conditional 
entropy are special cases of refinement in the underlying data distribution setting and introduced 
conditional refinement. Also a number of novel refinement measures were presented for the com- 
parison of classifiers under the classifier output setting. Finally, refinement in the underlying data 
distribution setting was used in a general procedure for deriving arbitrarily tight bounds on the 
Bayes error. 
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